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Abstract One challenge for dialogue agents is to recognize feelings of the conver- 
sation partner and respond accordingly. In this work, ROBERTa-GPT2 is proposed 
for empathetic dialogue generation, where the pre-trained auto-encoding ROBERTa 
is utilised as encoder and the pre-trained auto-regressive GPT-2 as decoder. With 
the combination of the pre-trained ROBERTa and GPT-2, our model realizes a new 
state-of-the-art emotion accuracy. To enable the empathetic ability of RoBERTa- 
GPT2 model, we propose a commonsense knowledge and emotional concepts ex- 
tractor, in which the commonsensible and emotional concepts of dialogue context 
are extracted for the GPT-2 decoder. The experiment results demonstrate that the 
empathetic dialogue generation benefits from both pre-trained encoder-decoder ar- 
chitecture and external knowledge. 


1 Introduction 


With the development of Spoken Dialogue Systems (SDSs), people are no longer 
satisfied with the task-oriented interaction, like book a train ticket or make a reser- 
vation; but are additionally interested in chit-chat communication. An expected trait 
of chit-chat agents is to be able to identify the user emotion and express their empa- 
thy. For instance, the psychology study in [41] shows that talking about an emotional 
experience to someone and sharing their emotions contributes to emotional recov- 
ery from the event. Hence, exactly identifying the user emotion and appropriately 
expressing their empathy will be a desired trait for SDSs. 


Ye Liu, Wolfgang Maier, Stefan Ultes 
Mercedes-Benz AG, Sindelfingen, Germany e-mail: ye.y.liu@daimler.com, wolf- 
gang.mw.maier@daimler.com, stefan.ultes@daimler.com 


Wolfgang Minker 
Ulm University, Ulm, Germany e-mail: wolfgang.minker @ uni-ulm.de 


2 Ye Liu, Wolfgang Maier, Wolfgang Minker and Stefan Ultes 


Table 1 shows an empathetic dialogue from the EmpatheticDialogues dataset 
[27]. A speaker tells a listener the lonely situation that they are facing, and the 
listener tries to understand the speaker’s feelings and responds accordingly. Even 
though sharing emotional experiences is a general manifestation for humans, it is a 
great challenge to train a chit-chat agent capable to understand the user emotion and 
respond empathetically. 


Table 1 One empathetic dialogue in EmpatheticDialogues dataset. 
Emotion Lonely 
Situation All my friends live in a different country 


Speaker Hi, I feel so lonely sometimes because all my friends live in a different country. 


Oh, I’m sure you are lonely. Maybe you can join some kind of club 


Listener that lets you meet new friends? 
Speaker I was thinking about it! I wanted to join a group for local moms. 
, Sk Nips : 
Listenėr That’s a good idea! This way you can also meet friends for yourself, but also maybe 


meet new friend’s for your children to hang out with while you do with their moms! 


Several works with Transformer-based encoder-decoder architecture [36] have 
been presented for empathetic dialogue generation, such as the multi-task learning 
(26, 27, 37] or mixture of experts [16]. However, the combination of a pre-trained 
auto-encoding encoder and a pre-trained auto-regressive decoder have not been ex- 
plored for empathetic dialogue generation. In this work, the pre-trained Robustly 
optimized BERT approach (RoBERTa) [18] as encoder and the pre-trained Gener- 
ative Pre-trained Transformer (GPT-2) [25] as decoder: ROoBERTa-GPT2 encoder- 
decoder architecture is presented for empathetic dialogue generation. The exper- 
iments with EmpatheticDialogues dataset show that the combination of ROBERTa 
and GPT-2 highly improves the emotion recognition ability and realizes a new state- 
of-the-art emotion accuracy. 

In addition to the advanced neural network architecture, some external knowl- 
edge also contributes to the empathetic dialogue generation. Humans generally un- 
derstand the world and express implicit emotions based on their experience and 
knowledge. Also, [39] demonstrates that commonsense knowledge is fundamental 
for chit-chat agents to understand conversations and generate appropriate responses. 
As shown in Fig. 1, the underlying commonsensible and emotional concepts of 
the speaker utterance can help the listener to better understand what the speaker 
is talking about. Hence, we propose an Commonsense Knowledge and Emotional 
Concepts Extractor (CKECE) for GPT-2 decoder in our work, to enable the com- 
monsense and empathetic response generation. In the CKECE, we firstly utilize 
KeyBERT [7] to extract the keywords from the dialogue context; then elicit the 
commonsensible and emotional concepts of the keywords based on commonsense 
knowledge: ConceptNet [34] and emotion lexicon: NRC_VAD [20]; finally the ex- 
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tracted concepts are fed into GPT-2 decoder in a more plain text format to guide the 
empathetic generation. 


< has subevent > enjoy nature < part of > Australian Alps 
< part of > Australian desert 


Speaker: I was hiking in the Outback of Australia the other day. I managed to 
swallow a dozen ostrich eggs on my adventure. 


< isa > bird < related to > exciting/journey 
< has subevent > chew food < is a > food 


Listener: Wow, do they taste good? 


Fig. 1 An example of EmpatheticDialogues dataset with underlying commonsense knowledge 
(blue part) and emotional concepts (red part). (The special token in < > represents the relation in 
commonsense knowledge: ConcepetNet [34].) 


2 Related Work 


Open-domain and chit-chat conversational models have been widely studied [30, 
38]. With the rise of public accessible datasets [9, 15, 27] and data-driven learning 
approaches [35, 36], several works have attempted to make chit-chat dialogue more 
engaging. Some aim to improve the personalization of responses by conditioning the 
generation on a persona profile [11]. Then the PersonaChat dataset [42] was partic- 
ularly introduced and the competition in ConvAI 2 challenge [5] demonstrated that 
the produced responses include more consistent personas by adding persona infor- 
mation into the model. However, the personalized dialogue models often can not 
take the feelings of their conversation partners into consideration. Besides the chit- 
chat research on displaying the consistent personality, some works focus on emo- 
tional and empathetic dialogue generation. The existing emotional dialogue models 
[3, 12, 32, 45, 46] generally generate the response depending on a predefined emo- 
tion, however, the empathetic dialogue models are capable of perceiving the emo- 
tion of the speaker and express their empathy without extra step to determine which 
emotion type to respond explicitly [33]. Hence, the empathetic dialogue model is 
more in line with the real world scenarios [14], because the listener is capable to 
infer the emotion of the speaker in human-human communication. 

In recent years, several works have been presented for empathetic dialogue gen- 
eration. [27] created a benchmark and dataset towards empathetic open-domain dia- 
logue. [16] softly combined the possible emotional responses from several separate 
experts to generate the final empathetic response. [13] proposed an multi-resolution 
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interactive empathetic dialogue model to evoke more emotional perception in di- 
alogue generation. [14] proposed a multi-type knowledge aware empathetic dia- 
logue generation framework to enhance the empathy of generations. The above- 
mentioned approaches are all trained from scratch. [21] proposed BERT2BERT for 
Arabic empathetic response generation, while the encoder and decoder are both 
warm started using pre-trained auto-encoding AraBERT [1] parameters. [40] intro- 
duced EmpTransfo and [17] presented CAiRE, both are empathetic aware model 
adapted from GPT [24]. With the release of encoder-decoder model in Hugging- 
face', where any pre-trained auto-encoding model as the encoder and any pre- 
trained auto-regressive model as the decoder can be initialized as a sequence-to- 
sequence model, we are more interested in the performance of pre-trained auto- 
encoding encoder and auto-regressive decoder architecture for empathetic dialogue 
generation. Furthermore, [28] performed an extensive study on leveraging variable 
pre-trained models for sequence generation tasks and demonstrated that combining 
RoBERTa [18] and GPT-2 [25] achieves strong results. Hence, ROBERTa-GPT2 is 
proposed in this work for empathetic dialogue generation. 

In addition, the corpora with emotion labelling play a significant role in em- 
pathetic dialogue generation. There are several interesting resources. [15] devel- 
oped the DailyDialog dataset, with manually emotion labelling to each utterance. 
[9] collected the EmotionLines dataset from TV shows and human-to-human chats, 
where each utterance is further annotated with one of seven emotion-categorical 
labels. However, only 5% of the utterances in DailyDialog and 16.68% in Emotion- 
Lines have varied emotional labels and others are either “none” or “happy” labels. 
Hence, they are not suitable for empathetic dialogue generation because of the ex- 
tremely unbalanced data distribution. [27] released an empathetic dialogue dataset: 
EmpatheticDialogues, which focuses explicitly on conversations about emotionally 
grounded personal situations and considers a richer, evenly distributed set of emo- 
tions. In our work, we conduct the experiment of empathetic dialogue generation 
with EmpatheticDialogues dataset. 


3 The Proposed Method 


In this work, we present the ROBERTa-GPT2 encoder-decoder architecture for em- 
pathetic dialogue generation, where the pre-trained auto-encoding RoBERTa [18] 
as encoder and pre-trained auto-regressive GPT-2 [25] as decoder. In addition, a 
Commonsense Knowledge and Emotional Concepts Extractor (CKECE), which is 
used to extract the relevant concepts from dialogue history, is proposed to enable 
the commonsensible and empathetic ability of GPT-2 decoder. In this section, the 
CKECE will be firstly introduced and then the ROBERTa-GPT2 architecture with 
extracted concepts will be shown. 


' https://huggingface.co/transformers/model_doc/encoderdecoder.html 
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3.1 Commonsense Knowledge and Emotional Concepts Extractor: 
CKECE 


For the CKECE, two knowledge sources: the commonsense knowledge ConceptNet 
[34] and the emotional lexicon NRC_VAD [20], and one keyword extraction tool, 
KeyBERT [7], are used. We firstly utilize the KeyBERT to extract the keywords of 
the dialogue context, and then filter out the most relevant commonsense knowledge 
and emotional concepts of the keywords with the confidence score of ConceptNet 
and emotional intensity of NRC_VAD. 


3.1.1 The CKECE components 


The three resources used in CKECE are introduced in the following: 

KeyBERT? is a minimal and easy-to-use keyword extraction technique that 
leverages BERT embeddings and cosine similarity to find the keywords and keyphrases 
in a document that are the most similar to the document itself. 

ConceptNet? is a large-scale and multilingual commonsense knowledge graph 
that describes general human knowledge in natural language. It comprises 5.9M as- 
sertions, 3.1M concepts and 38 relations. The nodes in ConceptNet are concepts and 
the edges are relations. Each (conceptl, relation, concept2) triplet is an assertion. 
Each assertion is associated with a confidence score. The assertion confidence score 
are usually in the [1, 10] interval. For example, (loneliness, CausesDesire, socialize) 
with confidence score of 3.464. 

NRC_VAD* is a lexicon that includes a list of more than 20k English words 
and their Valence, Arousal, and Dominance (VAD) scores. For a given word and 
a dimension, the scores range from 0 (lowest) to 1 (highest). The interpretations 
of NRC_VAD dimensions are presented in Table 2. Such as, the VAD score vector 
[Va, Ar, Do] of word “happiness” is [0.960, 0.732, 0.850]. 


Table 2 Interpretations of NRC_VAD dimensions. 


Dimensions Values Interpretations 


Valence (V3) [0, 1] Negative-Positive 
Arousal (A;) [0, 1] Calm-Excited 
Dominance (D,) [0, 1] Weak-Powerful 


2 https://github.com/MaartenGr/KeyBERT 
3 https://conceptnet.io/ 
4 https://saifmohammad.com/WebPages/nrc-vad.html 


6 Ye Liu, Wolfgang Maier, Wolfgang Minker and Stefan Ultes 


3.1.2 CKECE 


To extract more relevant concepts, we firstly utilize the KeyBERT to extract the key- 
words from the dialogue context. In this step, the recommended KeyBERT model 
“distilbert-base-nli-mean-tokens” is used and only maximal top 10 keywords with 
score larger than 0 are retained. An example of extracted keywords is shown in Fig. 
2: 

Then, we pick out the commonsense concepts from ConceptNet based on the 
keywords and denote them in a tuple (keyword, relation, concept, scaled confidence 


following Eq. 1 min-max normalization. 


s — ming 


, (1) 


min-max(s) = —————— 

maxs — mins 
where mins is 1 and maxs is 10. The processed s € [0,1] and the min-max nor- 
malization is also used in [14, 44]. With min-max normalization, the example 
(loneliness, CausesDesire, socialize) with confidence score 3.464 in Section 3.1.1 is 
transformed into (loneliness, CausesDesire, socialize, 0.274) tuple with scaled con- 
fidence score 0.274. In order to pick out the most relevant concepts, the following 
tuples will be removed in this step: 


e The keywords or concepts are stop words. (The union of stop words in NLTK 
[19] and SpaCy> are used.) 

e The scaled confidence score is less than a pre-defined threshold a. We set œ is 
0.1 in this work, i.e. s < 0.1. 

e The keywords and concepts are same or have the same stem. Like: (addition, 
Synonym, addition, 0.11); (actual, DerivedFrom, actually, 0.11). 

e The relation is in an excluded relation list. i.e. r € [Antonym, ExternalURL, 
NotDesires, NotHasProperty, NotCapableOf ,dbpedia, Distinct From, 
EtymologicallyDerivedF rom, EtymologicallyRelatedTo,SymbolOf, FormOf, 
AtLocation, DerivedF rom, SymbolO f| 


Furthermore, to enable the emotional concepts, we adopt NRC_VAD to compute 
emotion intensity values for the concepts c as the Eq. 2. 


n(c) =min-max(\Vale) — 5“) Q) 


where ||- ||, denotes J, norm. V,(c) and A;(c) represent valence and arousal score of 
concept c;, respectively. When c not in NRC_VAD, we set n(c) to the mid value of 
0.5. 

Lastly, the final score f in Eq. 3 is derived from three aspects: emotion intensity, 
semantic similarity and scaled confidence score. The semantic similarity cos(k;, ci) 
is the cosine similarity between keyword and concept both embedded by the Glo Ve 
[23], which stands for global vectors for word representation and is an unsupervised 
learning algorithm for obtaining vector representations for words. 


5 https://github.com/explosion/spaCy 
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F) = nck) +.cos(ki,ck) +s} , (3) 


We sort the candidate tuples in descending order of the final scores and select top 3 
tuples for each keyword. Maximal 10 tuples are chosen for every dialogue context. 
Then the extracted concepts are arranged in a more plain textual form: “keyword 
<relation> concept’, which is shown in Fig. 2, for GPT-2 decoder. 


Dialogue context 


2 66 


“I am being in fear lately.”, “oh no! Any particular reason why?”, 
“I stared to cough blood 3 days ago and I fear it must be cancer.” 


KeyBERT 


29 66 


Keywords | “cancer”, “cough”, “fear”, “blood” 


Extracted concepts ConceptNet + NRC_VAD 


fear <is a> panic; fear <related to> scared; cough <related to> sneeze; 


cancer <is a> disease; blood <part of> blood cell; ... 


eat ee A a ee E AT 


Gold response 


That’s horrible! But, it could be many other things instead. I hope you 
go to the doctor. 


Fig. 2. An example for the process of CKECE for the dialogue context. The extracted emotional 
concepts and emotional word in gold response are marked in red. The blue part in extracted con- 
cepts and gold response share same commonsense knowledge. 


3.2 Pre-trained RoBERTa-GPT2 encoder-decoder 


The RoBERTa [18] and GPT-2 [25] are both large architectures pre-trained on large 
collections of texts. Then the pre-trained models are widely fine-tuned in down- 
stream tasks. In this work, we explore the pre-trained ROBERTa-GPT2 as encoder- 
decoder architecture for empathetic dialogue generation. 
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3.2.1 The preliminaries of ROBERTa-GPT2 


The pre-trained auto-encoding RoBERTa and pre-trained auto-regressive GPT-2 are 
introduced in the following: 

RoBERTa’ has the same architecture as BERT [4], but uses a byte-level Byte- 
Pair Encoding (BPE) [29] as a tokenizer (same as GPT-2) and improved the training 
procedure of BERT [4]. 

GPT-2’ is a pre-trained large-scale unsupervised language model which gener- 
ates coherent paragraphs of text. GPT-2 is also widely used in task-oriented dialogue 
generation [2, 22] and chit-chat dialogue generation [17, 43]. 


3.2.2 ROBERTa-GPT2 


Fig. 3 shows our proposed ROBERTa-GPT2 encoder-decoder architecture for empa- 
thetic dialogue generation. The simplified input for ROoBERTa encoder and GPT-2 
decoder in Fig. 3 only shows the initial part of the sentences. And Fig. 2 and Fig. 3 
share the same dialogue example. 

The pre-trained RoBERTa as encoder process the dialogue context, where the 
< CLS > token is appended at the first place and < SEP > is for separating speaker 
utterance and listener utterance. The output of < CLS > token, pooled output, rep- 
resents the entire meaning of the input. A linear layer with softmax activation is 
added on the top of pooled output for emotion classification. The encoder outputs 
will be fed to the GPT-2 decoder for cross-attention mechanism. As shown in Fig. 3, 
the input for GPT-2 decoder starts with extracted concepts. During the training, the 
gold response is also attached after concepts for faster convergence and separated 
by <SEP> token. It is noteworthy that only the response part without extracted 
concepts is the output of GPT-2 decoder for computing the generation loss during 
the training. That means, the response is generated conditioned on the contextual 
information of encoder outputs with cross-attention mechanism and emotional con- 
cepts of decoder inputs with self-attention mechanism by combining pre-trained 
RoBERTa and GPT-2. Lastly, all the parameters of ROoBERTa-GPT2 are jointly 
trained end-to-end to optimize the emotion classification and response generation by 
minimising emotion cross entropy loss and maximum likelihood estimator (MLE) 
generation loss. 


6 https://github.com/pytorch/fairseq/tree/master/examples/roberta 
7 https://github.com/openai/gpt-2 
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emotion logits that’s horrible, ... <END> 


emotion 
classifier 


CKECE | 


l 
l 
Sen at a A 


<CLS> iam... <SEP> oh no! ... <SEP> i started ... <SEP> 


Fig. 3 Our proposed RoBERTa-GPT2 encoder-decoder architecture with CKECE guidance for 
empathetic dialogue generation. 


4 Experimental Settings and Results Analysis 


4.1 Dataset 


We conduct our experiment on the large-scale multi-turn EmpatheticDialogues [27], 
which consists of 25k one-to-one open-domain conversation grounded in emotional 
situations. And the EmpatheticDialogues dataset provides 32 evenly distributed 
emotion labels. 


4.2 Baselines 


We compare our models with the following four baselines. 


1) Transformer [36]: a Transformer-based encoder-decoder model trained with 
MLE generation loss. 

2) EmoPrepend-1 [27]: an extension of Transformer model with an additional su- 
pervised emotion classifier. The whole model is jointly trained by optimizing 
both the classification and generation loss. 

3) MoEL [16]: another extension of Transformer model, which softly combines the 
outputs of the multiple listeners. Each listener is optimized to react to a certain 
emotion and generate an empathetic response. 

4) MK-EDG [14]: a multi-type knowledge aware empathetic dialogue generation 
framework. Commonsense knowledge and emotional lexicon are used to enrich 
the dialogue utterance. 


Additionally, to better analyse our proposed ROBERTa-GPT architecture for empa- 
thetic dialogue model, we also conducted ROBERTa w/o GPT-2: only RoBERTa 
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encoder with emotion classifier trained with emotion loss; and RoBERTa-GPT2 
w/o CKECE: RoBERTa-GPT2 without the guidance of external knowledge. 


4.3 Training details 


The RoBERTa-GPT2 is trained with batch size 16 and learning rate le—5. Early 
stopping is applied during the training for saving the best model. During decoding, 
we use the top-k [6] and nucleus sampling (top-p) [8] decoding algorithms with 
top-k equal to 5 and top-p equal to 0.9. 


4.4 Automatic Evaluation Results 


To evaluate the performance of ROBERTa-GPT2 model, we firstly adopt the Emo- 
tion Accuracy as the agreement between the ground truth emotion labels and the 
predicted emotion labels by the emotion classifier. In addition, Perplexity [31] val- 
ues are utilized to measure the high-level general quality of the generation model. 
Furthermore, Distinct-1 and Distinct-2 [10] are used to measure the proportion of 
the distinct unigrams and bigrams in all the generated results to indicate diversity. 
Table 3 shows the evaluation results between our proposed methods and baselines. 
The results of MK-EDG in Table 3 are directly copied from [14], hence MK-EDG 
is absent from use cases in Table 4. 


Table 3 Evaluation results between ROBERTa-GPT2 and baselines 


Models Emotion Accuracy Perplexity Distinct-1 Distinct-2 
Transformer - 35.56 0.41 1.49 
EmoPrepend-1 0.3359 35.66 0.42 1.62 
MoEL 0.3425 37.69 0.43 1.72 
MK-EDG 0.3931 34.85 1.48 4.90 
RoBERTa w/o GPT-2 0.3439 - - - 
RoBERTa-GPT2 w/o CKECE 0.5262 14.97 1.62 10.47 


RoBERTa-GPT2 0.5151 13.57 2.04 11.68 


In Table 3, we observe our proposed ROBERTa-GPT2 outperforms the baselines 
by a large margin in terms of all automatic metrics and realizes a new state-of- 
the-art result. The emotion accuracy comparison between only RoBERTa encoder 
w/o GPT-2 and RoBERTa-GPT2 demonstrates that emotion classification benefits 
from the combination of ROBERTa encoder and GPT-2 decoder architecture. The 
RoBERTa-GPT2 has better results on Perplexity, Distinct-1 and Distinct-2 than 
RoBERTa-GPT2 w/o CKECE, indicating that injecting external knowledge con- 
tributes to the generation quality, though a little loss on emotion accuracy. 
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4.5 Use cases study 


Table 4 Generated responses from Transformer, EmoPrepend-1, MoEL and ROBERTa-GPT2. The 
underline part shows the empathetic and contextual of ROBERTa-GPT2 response. 


Emotion excited 
Speaker: I’m taking a vacation and going on a cruise 


Dial text 
ad aa to Alaska next week! 


Transformer Where are you going? 
EmoPrepend-1 That is awesome! Where are you going? 
MoEL I hope you have a great time! 
RoBERTa-GPT2 Iam sure you will enjoy the cruise! 
Emotion terrified 


Speaker: Do you even know how crazy it is to skydive? I must have been 
absolutely nuts when I did it. 
Listener: Oh I don’t think I could do that. I have a fear of falling 
from high places. 
Speaker: It gave me the biggest rush that’s for sure. But on the way 
down I was saying my prayers. 


Dialogue context 


Transformer That is a great idea. I have been there. 
EmoPrepend-1 I would be too. I do not know what you mean. 
MoEL I am sorry to hear that. I hope everything works out for you. 
RoBERTa-GPT2 That is a lot of stress! Iam glad you did not get injured. 
Emotion excited 

Dialogue context Speaker: I couldn’t wait to go to the concert. 
Transformer That sounds like a lot of fun! 
EmoPrepend-1 That sounds like a lot of fun! 

MoEL That sounds like a lot of fun! 
RoBERTa-GPT2 That is awesome! What kind of concert will it be? 
Emotion trusting 


Speaker: I let my daughter ride in the road by herself on her 


Dialogüs Contos skateboard the other day. 


Transformer That is awesome! What was it? 
EmoPrepend-1 That must have been a lot of fun! 

MoEL oh no! What happened? 
RoBERTa-GPT2 Wow! I would be a little nervous if she did not skate at all. 


Cases of generated responses from baselines Transformer, EmoPrepend-1, MoEL 
and our proposed method, ROBERTa-GPT2, are listed in Table 4. In the first case, 
Transformer and EmoPrepend-1! do not exactly understand what the speaker is say- 
ing. MoEL identifies the user emotion but the response is kind of generic. Be- 
sides the correct understanding user emotion, ROBERTa-GPT2 also knows that the 
speaker is talking about “cruise”. The baselines in the second case do not correctly 
recognise the user emotion. Compared with generic response of the baselines in 
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the third case, ROBERTa-GPT2 generates contextual response with a proper pos- 
itive emotion by replying with “awesome”. In the fourth case, the response of 
EmoPrepend-! is generic and other two baselines do not understand the speaker, 
while ROBERTa-GPT2 generates coherent and informative response by showing 
the concern. All the cases in Table 4 show that our proposed ROoBERTa-GPT2 can 
both handle with user emotion and dialogue content. 


5 Conclusion and Outlook 


In this work, we leverage pre-trained auto-encoding RoBERTa as encoder and pre- 
trained auto-regressive GPT-2 as decoder for empathetic dialogue generation. Mean- 
while, the external knowledge: commonsense knowledge and emotional lexicon; are 
utilized to extract emotional and commonsensible concepts from dialogue context 
for GPT-2 decoder to enable the empathetic and contextual responses. Both auto- 
matic metrics and cases study show that our proposed ROBERTa-GPT2 outperforms 
the baselines and demonstrate that the empathetic dialogue generation benefits from 
pre-trained modelling and external knowledge. 

In the future work, we will continually evaluate our proposed method for em- 
pathetic dialogue generation from human perspective. Meanwhile, we are also in- 
terested in other flexible methods for injecting external knowledge to empathetic 
dialogue system. 
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